9/29/2018

Potentially relevant bits of my story

  • Software Engineer and lazy in all the right ways
  • Worked with some Expert Systems folks for a couple years at a place known for decision-making
  • The inability of Expert Systems to deal with uncertainty bothered me to no end
  • Took time off to do Recurse Center and dive in
  • Realized the majority of people trying to solve problems like this were actaully doing Bayesian modeling

How do we make decisions?

  • Try to understand the state of the world
  • Guess at the processes through which the world reacts to stimuli
  • Associate utility scores with possible world states, and model relative chance
  • Often, communication of findings is key

Example problem - mail delivery

  • I get a few items of mail most days, but some days I get 0
  • How often should I be experiencing no-mail days?
  • I suspect the USPS may have an efficiency policy in place that suggests skipping some postal customers each day
  • I could use statistical analysis to make a case to a reporter that it is worth investigating
  • Imagine I care a lot about the mail and that the reporter is very busy

Why is it hard to assign utilities to actions?

  • Hard to represent the state of our knowledge about the world and how it changes
    • We have uncertainty and ambiguity around even the most deterministic of processes
    • And limited resources to discover the true process
  • Many possible models for a process
    • We will never know the True Model
    • Nor the world’s True State

Mail delivery uncertainty

  • Mail is easy to measure, but the process that generates mail is incredibly complex
    • For example, in a moment of weakness I used some expiring airline miles to subscribe to Vanity Fair
    • We don’t have time to model my weakness as a function of time and airline miles
  • A Poisson distribution captures much of the behavior we care about, but
    • Exactly how well does it describe our data?
    • Want to keep track of our uncertainty when making predictions or decisions
    • We suspect someone is messing with our Poisson
  • Visually explain to our reporter friend in terms they find compelling

What does frequentism preach?

  • Lots of totally valid frequentist math and procedures
    • Typically quite difficult
  • Instead, set up your experiment correctly and then use some off-the-shelf test or estimator to get The Best Answer (point estimate)
    • Oh, and make sure this list of assumptions all hold
  • Alternatively, do a ton of math to prove asymptotic gaurantees specific to your model
  • No parameter uncertainty, because probability only represents long-run frequency
    • This actually matters when thinking about and communicating results
  • “No” priors; incorporate minimal expert knowledge

What does machine learning preach?

  • We have computers now, calculus can show itself out
    • fair enough
  • Great success with algorithms that have only later had statistical interpretations ascribed to them
  • Progress = predictions 0.01% better than the last paper according to cross validation
    • Let’s ignore that the noise around cross-validated accuracy scores is much higher than 0.01%
  • Why would we need to interpret the model if its predictions are accurate?

What does Bayesianism preach?

  • Philosophically, inference without a prior is impossible
    • In frequentism it’s buried in an assumption
    • In ML you are not really sure if there is a deeper meaning under the code in the first place, but there is some implied prior
  • Retain a distribution (as opposed to point estimates like the mean) as long as possible before making a decision
  • We also (only recently?) have computers!
    • We can actually express an accurate, bespoke model for our data generating process without worrying about an analytic integral
  • Construct many models
  • Always Be Graphing

What are we modeling?

  • “If we knew our parameter values \(\theta\), how would data \(y\) be generated?” \[\pi(y \, | \, \theta)\]
  • But we want to learn what our data implies about \(\theta\)
  • \[\pi(\theta \, | \, y) = \frac{\pi(y \, | \, \theta) \pi(\theta)}{\pi(y)}\]
  • In our case, \(\theta\) might be the average amount of mail we get on any given day, or the percentage of days the carrier skips

Introducing Stan

  • Stan is an imperative probabilistic programming language
    • in the vein of BUGS, but can do arbitrary computation
  • A Stan program defines a probability model
    • declares data and (constrained) parameter variables
    • all in order to define a log posterior density function
  • Stan comes with a few inference algorithms
    • MCMC (HMC) for full Bayesian inference
    • VB for (one form of) approximate Bayesian inference
    • MLE for penalized maximum likelihood estimation

Why choose Stan?

  • Expressive
    • Stan is a full imperative programming language
    • Easier to communicate models
  • Robust
    • usually works; signals when it doesn’t
  • Efficient
    • effective sample size per unit time winner
    • 3 forms of parallelism coming out this year (CPU, GPU, cluster)
  • Great documentation, case studies, and community!

What can our mail data tell us?

  • When in doubt, simulate
  1. Draw a model w/ parameters from the priors over models and parameters
  2. fit the two models
  3. choose the model with the highest density (assuming models have equal probability)
  4. classify model choice as correct, false positive, or false negative
  5. count it up

Normal mail delivery Stan model

data {
  int<lower=0> num_days;
  int<lower=0> mail[num_days];
}
parameters {
  real<lower=0> avg_mail_per_day;
}
model {
  avg_mail_per_day ~ normal(0, 5);
  mail ~ poisson(avg_mail_per_day);
}

Fitting a Stan model with PyStan

import pystan, numpy as np
model = pystan.StanModel("usps1.stan")
data = dict(mail = np.array(r["mail"], dtype="int"), num_days=int(r["num_days"]))
fit = model.sampling(data)
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

                   mean se_mean     sd   2.5%     50%  97.5%  n_eff   Rhat
avg_mail_per_day   4.96  2.3e-3   0.08    4.8    4.96   5.11   1316    1.0

For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

Plotting the fit

from matplotlib import pyplot as plt
plt.hist(fit.extract()["avg_mail_per_day"], bins=20)

How do we interpret the fit summary?

  • Low sd
  • Good diagnostics (Rhat < 1.1, no divergences or other warnings)
    • For more on diagnostics and proper workflow, go to @betanalpha’s master class tomorrow!
  • Assuming this model is true, we seem pretty confident that I get about 5 pieces of mail a day.

Visually communicating model fit

from matplotlib import pyplot as plt
mail_rate = fit.extract()["avg_mail_per_day"] # 4000 posterior samples for this param
fake_data = np.random.poisson(mail_rate) #4000 fake data simulations
plt.hist([fake_data, data["mail"]], bins=15, density=True, color=["red", "blue"])

Add skip_percentage to model skipped days

parameters {
  real<lower=0> avg_mail_per_day;
  real<lower=0, upper=1> skip_percentage;
}
model {
  avg_mail_per_day ~ normal(0, 5);
  for (day in 1:num_days) {
    real without_skipping_density = poisson_lpmf(mail[day] | avg_mail_per_day);
    if (mail[day] == 0)
      target += log_mix(skip_percentage, 0, without_skipping_density);
    else
      target += without_skipping_density;
  }
}

Fit model with skip_percentage

Inference for Stan model: anon_model_7abb4b18761050015bf05aee642e2a30.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

                   mean se_mean     sd    25%    50%  97.5%  n_eff   Rhat
avg_mail_per_day   5.54  2.0e-3   0.09   5.48   5.54   5.73   2231    1.0
skip_percentage    0.99  2.4e-4   0.01   0.98   0.99    1.0   2815    1.0
lp__              -1581    0.03   1.06  -1581  -1580  -1580   1535    1.0

Samples were drawn using NUTS at Sat Oct 13 16:17:55 2018.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

Simulate data implied by parameter draws

mail_rate15 = fit15.extract()["avg_mail_per_day"]
skip_percentage = fit15.extract()["skip_percentage"]
fake_data15 = np.zeros(len(mail_rate))
for i in range(len(mail_rate15)):
    todays_mail =  np.random.poisson(mail_rate15[i])
    if np.random.binomial(1, skip_percentage[i]):
        fake_data15[i] = 0
    else:
        fake_data15[i] = todays_mail

Visually compare simulations with observed data


Fix our math; intro log scales

parameters {
  real<lower=0> avg_mail_per_day;
  real<lower=0, upper=1> skip_percentage;
}
model {
  avg_mail_per_day ~ normal(0, 5);
  for (day in 1:num_days) {
    real without_skipping_density = poisson_lpmf(mail[day] | avg_mail_per_day);
    if (mail[day] == 0)
      target += log_mix(skip_percentage, 0, without_skipping_density);
    else
      target += bernoulli_lpmf(0 | skip_percentage) + without_skipping_density;
  }
}

Fixed fit

Inference for Stan model: anon_model_387876dd9af8a07353d37466d2bde6fb.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

                   mean se_mean     sd    25%    50%  97.5%  n_eff   Rhat
avg_mail_per_day   5.52  1.6e-3   0.09   5.46   5.52    5.7   3350    1.0
skip_percentage     0.1  2.0e-4   0.01    0.1    0.1   0.13   3072    1.0
lp__              -1821    0.02   0.97  -1821  -1821  -1820   1884    1.0

Samples were drawn using NUTS at Sat Oct 13 13:45:07 2018.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

PPC plot for corrected model


Pure poisson (left) vs. with skipped days (right)


A typical non-Bayesian approach

  • Plug the model (sans prior) and data into an estimator like maximum likelihood (w/ implied uniform prior)
  • Get out the mode and perform a p-test - are the results significant?

What is wrong with optimization and point estimates?

Area in higher dimensions is weird


(from Betancourt’s Conceptual Introduction to HMC)

Area in higher dimensions is weird


(from Betancourt’s Conceptual Introduction to HMC)

Curse of dimensionality

Curse of dimensionality

  • “Somewhat counterintuitively, the average member of a population is an outlier. How could an item with every feature being average be unusual? Precisely because it is unusual to have so many features that close to average.” -Carpenter
  • Recall the Airforce’s attempt to design a cockpit for the average pilot

Curse of dimensionality


(from Mackay’s book)

How Stan’s HMC is different from optimizing